First File

Testing Quarto
Author

Your Name

Published

July 26, 2023

Modified

July 26, 2023

Introduction of the Social Capital Atlas

The dataset being analyzed in the final project is the Social Capital Atlas. This dataset comes from Data for Good at Meta (who is known to own Facebook and Instagram) and is organized into separated .csv files to analyze various US counties, zip codes, colleges, and high schools to measure three main measurements. Firstly, the economic connectedness measures the number of shared friends between people of different socioeconomic statuses. The second measure is the cohesiveness, and how much friends tend to be supported by other’s mutual friends. Finally, the civic engagem ent of each area is measured or the rate at which they volunteer.

Motivation

Growing up, I noticed economic disparsity within my own hometown and how that played with relationship circles. I grew up in a sub-urban beach town called Ventura, about an hour away from Los Angeles county. It was generally a quiet town; however, I noticed some economic and racial disparsity, with certain communities being separated from the rest. I especially noticed this in friend groups within public high schools. Those who were more wealthy and lived in more expensive areas typically congregated together, while those who lived in less expensive areas were excluded from these groups and instead congregated amongst themselves. To my knowledge, most of the causes for exclusion were not from direct bullying but just difference in township sections (otherwise known as where people lived) and general community expectations.

While I am unable to explore racial disparity in this dataset, I am able to see if this economic disparity affects friend groups through the country through this dataset. It will also be interesting to recognize whether higher rates of inclusion lead to higher rates of volunteer efforts, as it may assume a set of values within a community that truly sets it apart, or if it is all randomized.

Research Questions

  1. How does the connectedness between individuals with varying socioeconomic statuses relate to the cohesiveness of friend groups?

  2. To what extent does higher social cohesion in a society influence rates of civic engagement?

Methods

Exploration

Here we’ll be exploring summary information about the data, summary statistics, identify any missing/incorrect/outlier data, and conduct data preprocessing if necessary.

We’ll also cover 4 exploratory data visualizations to better understand its structure, contents, and whether the data is clean enough to continue into the analysis.

import pandas as pd
import matplotlib.pyplot as plt
import folium

df = pd.read_csv('https://data.humdata.org/dataset/85ee8e10-0c66-4635-b997-79b6fad44c71/resource/ec896b64-c922-4737-b759-e4bd7f73b8cc/download/social_capital_county.csv')
df
county county_name num_below_p50 pop2018 ec_county ec_se_county child_ec_county child_ec_se_county ec_grp_mem_county ec_high_county ... child_exposure_county child_high_exposure_county bias_grp_mem_county bias_grp_mem_high_county child_bias_county child_high_bias_county clustering_county support_ratio_county volunteering_rate_county civic_organizations_county
0 1001 Autauga, Alabama 5922.39210 55200.0 0.72077 0.00831 1.11754 0.02467 0.77223 1.21372 ... 1.14816 1.19944 0.05526 -0.22748 0.02668 -0.08229 0.10347 0.98275 0.04355 0.01518
1 1003 Baldwin, Alabama 15458.39600 208107.0 0.74313 0.00661 0.83064 0.01629 0.76215 1.28302 ... 0.84588 1.00797 0.02950 -0.21519 0.01802 -0.05241 0.09624 0.98684 0.06117 0.01526
2 1005 Barbour, Alabama 4863.97360 25782.0 0.41366 0.00978 0.58541 0.02707 0.35927 0.91897 ... 0.63306 0.71967 0.13457 -0.34086 0.07528 -0.19714 0.14911 0.99911 0.02093 0.01474
3 1007 Bibb, Alabama 3061.49340 22527.0 0.63152 0.01175 0.72265 0.03027 0.68094 1.06378 ... 0.71433 0.72395 0.04108 -0.27727 -0.01165 -0.15993 0.14252 0.99716 0.05294 0.01439
4 1009 Blount, Alabama 6740.91160 57645.0 0.72562 0.00985 0.76096 0.02466 0.79584 1.10569 ... 0.74821 0.79375 0.00217 -0.24946 -0.01704 -0.08745 0.11243 0.99069 0.05704 0.01724
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3084 56037 Sweetwater, Wyoming 2402.96900 44117.0 0.96235 0.01280 1.14781 0.02794 1.13449 1.32399 ... 1.12164 1.12907 0.09519 -0.12030 -0.02333 -0.08683 0.10809 0.99710 0.07321 0.01225
3085 56039 Teton, Wyoming 783.24982 23059.0 1.07623 0.01744 1.23113 0.04692 1.13296 1.63551 ... 1.32874 1.35341 0.14337 -0.11958 0.07346 -0.07364 0.09253 0.98648 0.09747 0.03223
3086 56041 Uinta, Wyoming 2174.06180 20609.0 0.95452 0.01404 1.04595 0.03455 0.92831 1.32040 ... 1.05446 1.06284 0.13816 -0.12194 0.00808 -0.06074 0.11204 0.99479 0.06942 0.01222
3087 56043 Washakie, Wyoming 872.51544 8129.0 0.90667 0.01928 0.90794 0.04962 0.78223 1.29208 ... 0.88480 0.88589 0.06667 -0.20435 -0.02615 -0.06076 0.11592 0.99708 0.05843 0.03512
3088 56045 Weston, Wyoming 635.28436 7100.0 0.97840 0.02036 1.09118 0.05823 0.93135 1.28553 ... 1.03325 1.05526 0.02279 -0.17229 -0.05606 -0.04609 0.11927 0.99730 0.13635 0.02375

3089 rows × 26 columns

Summary Information

Every county that had statistics via Facebook data were included in the dataset. While there are many columns, many of them will not be used for the final analysis since they are irrelevant to the research questions. Here is a summary of what columns this project will utilize, as well as explanation for why this project will not utilize the others.

ec_county - This measures the level of economic connectedness within a county. The equation for this is two times the share of high-SES (social economic status) friends amongst low-SES individuals and then averaged. This equation is from the research journal “Social capital I: Measurement and Associations with Economic Mobility.”
ec_high_county - This measures the connectedness between high-SES individuals solely, once again with a value that is averaged. The main difference is that this does not include how connected they are with people from lower economic statuses.
exposure_grp_mem_county - This measures how often high-SES individuals are exposed to low-SES individuals with the same formula of from ec_county.
exposure_grp_mem_high_county - This measures how often high-SES individuals are exposed to only high-SES individuals, once again measured with a similar formula.
clustering_county - This calculates the average fraction of an individual’s friend pairs who are also friends with each other including people only within the relevant county.
volunteering_rate_county - The percentage of Facebook users within a county that are predicted to be members of a volunteering or activism group. Secret groups or large groups that Facebook identified as “clearly misclassified” were not included in this calculation.

There are many other columns. One of the main categories include those that revolve around childhood connectedness specifically. These are not calculated because the research question aims to address all populations. Additionally, the two categories, the support_ratio_county, which measures the proportion of people who share a third mutual friend, and so was civic_organizations_county, which measures the number of Facebook Pages predicted to be “Public Good” within a county, are not included as they would not address the research questions effectively. As part of the cleaning process, these categories will be removed. The standard error colums will be kept, as they may need to be referred to in the conclusion.

Summary Statistics

columns_removed = ['num_below_p50', 'pop2018', 'child_ec_county', 'child_ec_se_county', 'ec_grp_mem_county', 'child_high_ec_county', 'child_high_ec_se_county', 'ec_grp_mem_high_county','child_exposure_county','child_high_exposure_county','bias_grp_mem_county','bias_grp_mem_high_county','child_bias_county','child_high_bias_county', 'support_ratio_county', 'civic_organizations_county']

df = df.drop(columns=columns_removed)

column_max = df.max()
column_min = df.min()
column_stats_df = pd.DataFrame({'Column': df.columns, 'Minimum': column_min, 'Maximum': column_max})

print(column_stats_df)
                                                    Column  \
county                                              county   
county_name                                    county_name   
ec_county                                        ec_county   
ec_se_county                                  ec_se_county   
ec_high_county                              ec_high_county   
ec_high_se_county                        ec_high_se_county   
exposure_grp_mem_county            exposure_grp_mem_county   
exposure_grp_mem_high_county  exposure_grp_mem_high_county   
clustering_county                        clustering_county   
volunteering_rate_county          volunteering_rate_county   

                                                Minimum                Maximum  
county                                             1001                  56045  
county_name                   Abbeville, South Carolina  Ziebach, South Dakota  
ec_county                                       0.29469                 1.3597  
ec_se_county                                    0.00436                0.05023  
ec_high_county                                  0.70062                1.71507  
ec_high_se_county                               0.00475                0.05099  
exposure_grp_mem_county                          0.2552                1.48628  
exposure_grp_mem_high_county                    0.51013                1.66616  
clustering_county                               0.07162                0.26097  
volunteering_rate_county                        0.00965               0.308736  

This is the following cell range for each column:
ec_county - 0.29469 - 1.3597. The lower the number is, the less connectedness there is between high-SES and low-SES individuals. The standard deviation ranges from around 0.004-0.05 per county.
ec_high_county - 0.70062 - 1.71507. The lower the number is, the less connectedness there is between high-SES individuals. The standard deviation for this ranges from 0.004-0.05 per county, similar to the original ec_county statistic.
exposure_grp_mem_county - 0.2552-1.48628. The lower the number is, the less exposure people of high-SES to low-SES individuals.
exposure_grp_mem_high_county - 0.51013-1.66616. The lower the number is, the less exposure high-SES people have to other high-SES individuals.
clustering_county - 0.07162-0.26097. The lower the number equates to a lower amount of mutual friend circles within the county.
volunteering_rate_county - 0.00965 - 0.308736. The lower the number is, the less volunteers the county has.

We will also note the mean (average) values for future reference.

mean_values = df.mean(numeric_only=True)
print(mean_values)
county                          30218.783101
ec_county                           0.814464
ec_se_county                        0.013409
ec_high_county                      1.252636
ec_high_se_county                   0.014754
exposure_grp_mem_county             0.906089
exposure_grp_mem_high_county        1.078581
clustering_county                   0.116456
volunteering_rate_county            0.078068
dtype: float64

Data Exploration

Before analyzing the data, we should identify and understand the dataset on these following bases.
    1) Is there any relationship between economic connectedness between low and high SES and just high SES? If so, what’s the pattern? Do those categories seem to have a balanced distribution, or are there more counties that are on the lower or higher end of the spectrum?
    2) What do the levels of clustering_county, or fraction of friend_pairs, look like around the country? Is there any clear average?
    3) What do the levels of volunteering rates look like around the country? Is there any clear average?

#### Data Exploration #1

df.plot.scatter(x='ec_county', y='ec_high_county')
plt.xlabel("Economic Connectedness between low and high SES")
plt.ylabel("Connectedness within high SES")
plt.title("Relationship between Economic Connectedness of Low and High SES and Economic Connectedness within High SES")
plt.show()

This scatterplot aims to recognize if there’s any correlation between connectedness between low and high SES and just high SES individuals.

It seems there is a somewhat balanced, upward linear pattern between the two indicators. For the most part, the higher the economic connectedness between low and high SES indicates it’ll also be higher between those in just higher SES. There are some outliers, but none that indicate something incredibly wrong with the data. It tends to fall within the standard deviation predicted above. It is important to note that most of the outliers tend to fall with higher scores on the higher connectedness within just high-SES individuals, indicating that many counties do have a tendency to prioritize connectedness within one’s own social class rather than a variety. It’ll be interesting to see if this correlates with friendship circles as we get into the analysis.

While it looks like the economic connectedness is relatively evenly distributed country-wide, a histogram should be conducted to double check on the categories.
#### Data Exploration #2

plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)  
plt.hist(df['ec_county'], bins=10, range=(0.2, 1.4), edgecolor='black')
plt.xlabel("Economic Connectedness per ec_county")
plt.ylabel("Frequency")
plt.title("Histogram of Economic Connectedness Between Low and High-SES")

plt.subplot(1, 2, 2)  
plt.hist(df['ec_high_county'], bins=10, range=(0.7, 1.7), edgecolor='black')
plt.xlabel("Economic Connectedness per ec_high_county")
plt.ylabel("Frequency")
plt.title("Histogram of Economic Connectedness for only High-SES")

plt.subplots_adjust(wspace=0.4)
plt.show()

The original prediction of even distribution across the economic connectedness scores is incorrect.

The economic connectedness between low and high SES appears to be on somewhat a bell-curve, which means that there is higher frequency in the mid-range of the distribution with a score of around 0.8. This also happens to be the average or mean score. In simpler terms, this means that counties tend to have a moderate level of connectedness between people of different socioeconomic status.

On the other hand, the economic connectedness amongst only those of high-SES is slightly skewed to the higher range of the data. This means that those within the higher socioeconomic category tend to have stronger connections. This might also explain why in the scatterplot most of the deviations were indicated higher connectedness with only high-SES individuals.
#### Data Exploration #3

def extract_county_name(name): ## couldn't find a .geojson that had the counties lists as (County Name, State Name), like the dataset, hence the function
    parts = name.split(',')
    if len(parts) > 1:
        return parts[0].strip()  
    else:
        return name

df['county_name'] = df['county_name'].apply(extract_county_name)

m = folium.Map(location=[36, -98], zoom_start=4)

county_geo_url = "https://eric.clst.org/assets/wiki/uploads/Stuff/gz_2010_us_050_00_20m.json"

folium.Choropleth(
    geo_data=county_geo_url,
    data=df,
    columns=['county_name', 'clustering_county'],
    key_on='feature.properties.NAME',  
    fill_color='YlGnBu',
).add_to(m)

m
Make this Notebook Trusted to load map: File -> Trust Notebook

In the main parts of the United States, most counties tend to be yellow or a ligher green, with some even getting into an aquamarine panel. Nevertheless, it’s important to note that the lighter green demonstrates relatively low clusters. Another histogram will be essential to confirm this.

As an interesting note, clusters are extremely frequent in Alaskan counties with many of the counties being aquamarine, medium blue, or dark blue. This might be because the population in Alaska is lower, so communities may be more tightly knit.
#### Data Exploration #4

df.hist(column='clustering_county')
plt.xlabel("Frequency of Clusters / Mutual Connections within every County")
plt.ylabel("Frequency")
plt.title("Frequency of Relationship Cohesiveness within every County")
plt.show()

As predicted, clusters tend to be less frequent within each county, with the histogram heavily skewed to the left.

Data Exploration #5

m = folium.Map(location=[36, -98], zoom_start=4)

county_geo_url = "https://eric.clst.org/assets/wiki/uploads/Stuff/gz_2010_us_050_00_20m.json"

folium.Choropleth(
    geo_data=county_geo_url,
    data=df,
    columns=['county_name', 'volunteering_rate_county'],
    key_on='feature.properties.NAME',  
    fill_color='YlOrRd',
).add_to(m)

m
Make this Notebook Trusted to load map: File -> Trust Notebook

This map seems to be more evenly distributed. Many of the states seem to be yellow to light orange, with a few noticeable ones in the reds. However, an important note is that the west (except for California) midwest, and northeast tend to have more counties in the orange range. Alaska and Hawaii entirely seem to be orange to dark orange as well.

This should also be confirmed with a histogram, but from this map it seems that the rate will be similarly skewed as the cluster and cohesiveness rate.
#### Data Visualization #6

df.hist(column='volunteering_rate_county')
plt.xlabel("Frequency of Clusters / Mutual Connections within every County")
plt.ylabel("Frequency")
plt.title("Frequency of Relationship Cohesiveness within every County")
plt.show()

The above prediction was correct.

Data Cleaning

As a final step to the process, we will do some final data cleaning. There is no huge outlier, so worrying about that is no problem. However, as identified in the maps, there are some counties with no data. We will first identify whether there are any rows with missing values - if not, this would mean that the values did not correctly match up with the .geojson.

missing_values = df[df.isnull().any(axis=1)]
print(missing_values)
      county         county_name  ec_county  ec_se_county  ec_high_county  \
52      1105               Perry        NaN           NaN             NaN   
71      2060         Bristol Bay        NaN           NaN             NaN   
82      2164  Lake and Peninsula        NaN           NaN             NaN   
186     6003              Alpine        NaN           NaN             NaN   
255     8023            Costilla        NaN           NaN             NaN   
...      ...                 ...        ...           ...             ...   
2707   48433           Stonewall        NaN           NaN             NaN   
2713   48447        Throckmorton        NaN           NaN             NaN   
2748   49009             Daggett        NaN           NaN             NaN   
2759   49031               Piute        NaN           NaN             NaN   
2831   51091            Highland        NaN           NaN             NaN   

      ec_high_se_county  exposure_grp_mem_county  \
52                  NaN                      NaN   
71                  NaN                      NaN   
82                  NaN                      NaN   
186                 NaN                      NaN   
255                 NaN                      NaN   
...                 ...                      ...   
2707                NaN                      NaN   
2713                NaN                      NaN   
2748                NaN                      NaN   
2759                NaN                      NaN   
2831                NaN                      NaN   

      exposure_grp_mem_high_county  clustering_county  \
52                             NaN           0.198011   
71                             NaN           0.133101   
82                             NaN           0.156810   
186                            NaN           0.091637   
255                            NaN           0.100376   
...                            ...                ...   
2707                           NaN           0.133058   
2713                           NaN           0.105239   
2748                           NaN           0.094783   
2759                           NaN           0.123849   
2831                           NaN           0.152162   

      volunteering_rate_county  
52                    0.031368  
71                    0.042672  
82                    0.086069  
186                   0.146537  
255                   0.062568  
...                        ...  
2707                  0.033803  
2713                  0.244292  
2748                  0.026229  
2759                  0.030337  
2831                  0.188111  

[77 rows x 10 columns]

While the ec_county is missing, none of the values are missing from clustering_county or volunteering_rate_county. This means certain values did not match up with the .geojson file. We will check for these values.

import requests
response = requests.get(county_geo_url)
county_geo_data = response.json()

geojson_names = set(feature['properties']['NAME'] for feature in county_geo_data['features'])
df_names = set(df['county_name'])
missing_names = df_names - geojson_names
print("Missing county names:", missing_names)
Missing county names: {'St. Louis County', 'Franklin County', 'St. Louis City', 'Richmond City', 'Roanoke City', 'Baltimore City', 'Baltimore County', 'Fairfax County', 'Ste Genevieve', 'Roanoke County', 'Richmond County', 'Bedford County', 'DoÃ\x83±a Ana'}

These are the county names that weren’t addressed in the .geojson.

Many of these have different names in the main dataset. These seem to be the common errors. For example, Richmond County is just named Richmond.
    1) Counties like “Richmond County” that have no city are simply just named “Richmond” in the dataset. Some also have periods, such as “Ste. Genevieve.”
    2) Differentiating between a city and county is not indicated in the name of the geojson.

We will fix all of these, and then make sure we do not have any more missing names.

name_mapping = {
    "0500000US51161": {"NAME": "Roanoke County"},
    "0500000US51770": {"NAME": "Roanoke City"},
    "0500000US24510": {"NAME": "Baltimore City"},
    "0500000US29189": {"NAME": "St. Louis County"},
    "0500000US51059": {"NAME": "Fairfax County"},
    "0500000US24005": {"NAME": "Baltimore County"},
    "0500000US51019": {"NAME": "Bedford County"},
    "0500000US35013": {"NAME": "DoÃ\x83±a Ana"},
    "0500000US51159": {"NAME": "Richmond County"},
    "0500000US51760": {"NAME": "Richmond City"},
    "0500000US29510": {"NAME": "St. Louis City"},
    "0500000US29186": {"NAME": "Ste Genevieve"},
    "0500000US51067": {"NAME": "Franklin County"},
}

def update_geojson_feature_properties(feature, properties):
    feature['properties'].update(properties)

for feature in county_geo_data['features']:
    census_area = feature['properties'].get('GEO_ID', '')
    if census_area in name_mapping:
        update_geojson_feature_properties(feature, name_mapping[census_area])

geojson_names = set(feature['properties']['NAME'] for feature in county_geo_data['features'])
df_names = set(df['county_name'])
missing_names = df_names - geojson_names
print("Missing county names:", missing_names)
Missing county names: set()

Note, I want to change the cleaning to be in methods and create a separate section for the exploration called Results. I made the mistake of not doing this prior, but I will switch these up in the final project.

Discussion

This section will demonstrate more data visualizations and include analysis of such to capture the big picture and findings towards the research question now that we understand the dataset.

[Include Introduction Here, Including Hypothesis]

Data Visualization #1

df.plot.scatter(x='ec_county', y='exposure_grp_mem_county')
plt.xlabel("Economic Connectedness between low and high SES")
plt.ylabel("Exposure of Population to low and high SES")
plt.title("Relationship between Exposure to Connection between low and high SES")
plt.show()

Include analysis here.

Data Visualization #2

df.plot.scatter(x='ec_county', y='clustering_county')
plt.xlabel("Economic Connectedness between low and high SES")
plt.ylabel("Level of Cohesiveness")
plt.title("Relationship between relationship connectedness between SES and overall cohesiveness in a county")
plt.show()

Include analysis here. Mention how this doesn’t fall under the hypothesis that higher connectedness makes for higher cohesiveness.

Data Visualization #3

Include analysis here. Mention how this doesn’t fall under the hypothesis that higher connectedness makes for higher cohesiveness.

Data Visualization #4

m = folium.Map(location=[36, -98], zoom_start=4)

folium.Choropleth(
    geo_data=county_geo_url,
    data=df,
    columns=['county_name', 'clustering_county'],
    key_on='feature.properties.NAME',
    fill_color='YlOrRd',
    name='Cohesiveness Levels', 
).add_to(m)

folium.Choropleth(
    geo_data=county_geo_url,
    data=df,
    columns=['county_name', 'ec_county'],
    key_on='feature.properties.NAME',
    fill_color='YlOrRd',
    name='Connectedness Levels', 
).add_to(m)

folium.LayerControl().add_to(m)

m
Make this Notebook Trusted to load map: File -> Trust Notebook

Include analysis here. Mention how this doesn’t fall under the hypothesis that higher connectedness makes for higher cohesiveness.

Data Visualization #5

m = folium.Map(location=[36, -98], zoom_start=4)

folium.Choropleth(
    geo_data=county_geo_url,
    data=df,
    columns=['county_name', 'clustering_county'],
    key_on='feature.properties.NAME',
    fill_color='YlOrRd',
    name='Cohesiveness Levels', 
).add_to(m)

folium.Choropleth(
    geo_data=county_geo_url,
    data=df,
    columns=['county_name', 'ec_high_county'],
    key_on='feature.properties.NAME',
    fill_color='YlOrRd',
    name='High-SES Connectedness', 
).add_to(m)



folium.LayerControl().add_to(m)

m
Make this Notebook Trusted to load map: File -> Trust Notebook

Include analysis here. Mention how this doesn’t fall under the hypothesis that higher connectedness makes for higher cohesiveness.

Data Visualization #6

plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)  
plt.hist(df['clustering_county'], bins=10, range=(0.071, 0.26), edgecolor='black')
plt.xlabel("Cohesiveness")
plt.ylabel("Frequency")
plt.title("Frequency of Cohesiveness")

plt.subplot(1, 2, 2)  
plt.hist(df['volunteering_rate_county'], bins=10, range=(0.00965, 0.308), edgecolor='black')
plt.xlabel("Volunteer Rate")
plt.ylabel("Frequency")
plt.title("Frequency of Volunteer Rate")

plt.subplots_adjust(wspace=0.4)
plt.show()

Include analysis here. Mention how this doesn’t fall under the hypothesis that higher connectedness makes for higher cohesiveness.

Data Visualization #7

df.plot.scatter(x='clustering_county', y='volunteering_rate_county')
plt.xlabel("Cohesiveness Within County")
plt.ylabel("Volunteering Rate Within County")
plt.title("Relationship between Cohesiveness and Volunteer Rate")
plt.show()

Include analysis here. Mention how this doesn’t fall under the hypothesis that higher connectedness makes for higher cohesiveness.

Data Visualization #8

m = folium.Map(location=[36, -98], zoom_start=4)

folium.Choropleth(
    geo_data=county_geo_url,
    data=df,
    columns=['county_name', 'clustering_county'],
    key_on='feature.properties.NAME',
    fill_color='YlOrRd',
    name='Cohesiveness Levels', 
).add_to(m)

folium.Choropleth(
    geo_data=county_geo_url,
    data=df,
    columns=['county_name', 'volunteering_rate_county'],
    key_on='feature.properties.NAME',
    fill_color='YlOrRd',
    name='Volunteering Rate', 
).add_to(m)



folium.LayerControl().add_to(m)

m
Make this Notebook Trusted to load map: File -> Trust Notebook

Conclusion

Conclusion of discussion to be written here.